import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load the specific sheet named 'Webinar Leads' from the Excel file
df = pd.read_excel('US Pathway Complete Data.xlsx', sheet_name='campaign_performance')
# Display the first few rows of the dataframe
df.head()
| dates | campaign_name | campaign_start_date | creative_name | total_spent | impressions | clicks | click_through_rate | leads | platform | adset_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2024-04-29 | USP_Search_APTNTSKA_250124 | NaT | -- | 6766.15 | 1250 | 220 | 0.176000 | 25.0 | GMAT/GRE | |
| 1 | 2024-05-02 | USP_2_Lead-gen-2_Top4-states_050424 | NaT | #7 ( Graphic ) - Get A Free Pass | 946.38 | 3154 | 9 | 0.285352 | 1.0 | USP-2_KA_TN_050424 | |
| 2 | 2024-05-02 | USP_2_Lead-gen-2_Top4-states_050424 | NaT | USP_S3_never-been-easier | 170.40 | 2695 | 8 | 0.296846 | 0.0 | USP-2_KA_TN_050424 | |
| 3 | 2024-05-02 | USP_2_Lead-gen-2_Top4-states_050424 | NaT | USP_S4_Do-you-have | 55.65 | 577 | 2 | 0.346620 | 0.0 | USP-2_KA_TN_050424 | |
| 4 | 2024-05-02 | USP_2_Lead-gen-2_Top4-states_050424 | NaT | USP_S1_We-guarantee | 296.94 | 1466 | 10 | 0.682128 | 1.0 | USP-2_AP_TS_050424 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 28534 entries, 0 to 28533 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dates 28534 non-null datetime64[ns] 1 campaign_name 28534 non-null object 2 campaign_start_date 4247 non-null datetime64[ns] 3 creative_name 28534 non-null object 4 total_spent 28534 non-null float64 5 impressions 28534 non-null int64 6 clicks 28534 non-null int64 7 click_through_rate 28497 non-null float64 8 leads 28534 non-null float64 9 platform 28534 non-null object 10 adset_name 24287 non-null object dtypes: datetime64[ns](2), float64(3), int64(2), object(4) memory usage: 2.4+ MB
# Check the data type of 'leads' before conversion
print(df['leads'].dtype)
# Check for non-numeric values and NaNs
print(df['leads'].unique())
# Fill NaNs with 0 or another value if necessary
df['leads'] = df['leads'].fillna(0)
# Convert 'leads' to integer
df['leads'] = df['leads'].astype(np.int64)
# Check the data type of 'leads' after conversion
print(df['leads'].dtype)
float64 [ 25. 1. 0. 6. 4. 2. 8. 3. 11. 7. 14. 5. 10. 15. 20. 28. 9. 16. 23. 22. 18. 38. 31. 16.91 40. 13. 12. 19. 6.28 36. 4.09 51. 17. 33. 37. 24. 30. 46. 45. 26. 21.68 8.76 21. 128. 29. 13.24 43. 15.5 11.48 2.69 14.98 7.52 27. 32. 4.5 0.18 9.36 72. 39. 34. 0.98 52. 50. 59. 66. 2.03 7.5 55. 3.5 1.78 35. 73. 9.5 1.19 6.81 10.86 12.14 103. 53. 0.5 1.82 23.18 2.64 3.64 3.67 19.83 44. 8.13 12.68 3.52 41. 5.33 0.97 2.61 17.66 153. 16.36 81. 80. 5.5 29.22 8.16 64. 25.67 17.56 13.68 54. 114. 4.29 1.37 3.88 11.63 98. 5.78 14.5 42. 14.89 3.87 0.64 2.52 21.5 1.5 1.64 1.36] int64
Leads will be mostly the count of people ,so the datatype must be in the form of integer.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 28534 entries, 0 to 28533 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dates 28534 non-null datetime64[ns] 1 campaign_name 28534 non-null object 2 campaign_start_date 4247 non-null datetime64[ns] 3 creative_name 28534 non-null object 4 total_spent 28534 non-null float64 5 impressions 28534 non-null int64 6 clicks 28534 non-null int64 7 click_through_rate 28497 non-null float64 8 leads 28534 non-null int64 9 platform 28534 non-null object 10 adset_name 24287 non-null object dtypes: datetime64[ns](2), float64(2), int64(3), object(4) memory usage: 2.4+ MB
df['creative_name'].unique()
array([' --', '#7 ( Graphic ) - Get A Free Pass',
'USP_S3_never-been-easier', 'USP_S4_Do-you-have',
'USP_S1_We-guarantee', 'USP_kavita&steffi_VideoAd6',
'#8 ( Graphic ) - Study In The USA', 'USP_S2_Faced-rejection',
'USP_S5_easy-for-you', 'State University Of New York_New',
'USP_RG_VideoAd5', '#15 ( Graphic ) - Dual coursework_Plain',
"#18 - Master's Degree In 12 Months – Rutgers",
'#16 ( Graphic ) - Dual coursework_Hands',
"#6 ( Graphic ) - Master's Degree In 12 Months",
'Tomorrow_Static-6', 'USP_shreya&steffi_VideoAd8',
'USP_Drexel_1_EWYL', 'USP_shreya_VideoAd7',
'USP_Drexel_EWYL_new-Webinar', '15-days left_Static-1',
'#4 ( Graphic ) - How To Get Admmission In Your Dream Universities In US',
"#5 ( Graphic ) - Get Master's Degree in Just 12 Months",
'less than a week_Static-3', 'USP_Drexel_Repay_new-Webinar',
'USP_Drexel_EM_new-Webinar', 'USP_S12_Rutgers2',
'USP_S14_Akron_13L', 'USP_S13_Akron', 'USP_S11_Rutgers',
'USP_S10_Rider_13L', 'USP_S9_Stony_25L', '10-days left_Static-2',
'USP_S6_SeattleU', '5 days left_Static-5', 'Last chance_Static-4',
'USP_S8_Rider', 'USP_RIT_1_Getpaid', 'USP_RIT_2_EWYL_INR',
'USP_S7_Stony', 'USP_Drexel_EWYL_new', 'USP_Drexel_EM_new',
'USP_Drexel_Repay_new', 'USP_Webinar_without_GMAT/GRE',
'USP_Program_Linkedin_Static_C2_Cost-Generic',
'USP_We-guarantee_C32', 'USP_never-been-easier_C34',
'USP_Program_Linkedin_Static_C12_CW-3YWP', 'USP_easy-for-you_C36',
'USP_RG_webinar_V1', 'USP_Do-you-have_C35',
'USP_Program_Linkedin_Static_C11_SB-3YWP',
'USP_Program_Linkedin_Static_C28_CWRU_3YWP',
'USP_Program_Linkedin_Static_C3_CWRU-Cost',
'USP_Program_Linkedin_Static_C13_Notepad',
'USP_Program_Linkedin_Static_C14_Bridge-3YWP',
'USP_Program_Linkedin_Static_C19_Dual-Course-white',
'USP_Shreya_webinar_V1', 'USP_Influencer_videoAd4',
'Steffi and Shreya', 'USP_Program_Linkedin_Static_C5_Rutgers-Cost',
'USP_Program_Linkedin_Static_C31_2.25L',
'#2 ( Graphic ) - State University Of New York',
'#3 ( Graphic ) - Worried About GMAT Score?',
'#1 ( Graphic ) - Want To Build Your Career In The US?',
'USP_Faced-rejection_C33',
'USP_Program_Linkedin_Static_C9_Avg-Salary',
'USP_Webinar_Saturdayfuture',
"#11 ( Graphic ) - Master's Degree In 12 Months_CWRU",
'USP_RG_webinar_V4_Namita', '#9 ( Graphic ) - Study In The US',
'#10 ( Graphic ) - Get Into Top US Universitie',
'#17 ( Graphic ) 2.25L',
'USP_Program_Linkedin_Static_C7_Drexel-Cost',
'USP_Influencer_videoAd2', 'USP_Influencer_videoAd3',
'USP_Influencer_videoAd1', '#9 ( Graphic ) - Study In The USA',
'USP_Program_Linkedin_Static_C17_SUNY-Cost-white',
'USP_Program_Linkedin_Static_C15_RIT-Cost-white',
'USP_Program_Linkedin_Static_C27_Dual-coursework_Hands',
'#7 ( Graphic ) - Get a free pass',
'#8 ( Graphic ) - Study in The USA',
'USP_Program_Linkedin_Static_C24_SB-22L',
'USP_Program_Linkedin_Static_C4_RIT-Cost',
'#9 ( Graphic ) - Study in The USA',
'USP_Program_Linkedin_Static_C29_Rutgres_3YWP', 'Shreya',
'USP_Program_Linkedin_Static_C23_Rutgers-33K', 'RG Video',
'USP_Program_Linkedin_Static_C1_Without-Any-Barriers',
'USP_Program_Linkedin_Static_C25_Tulane-19L',
'USP_Program_Linkedin_Static_C8_Dual-Course',
'USP_Program_Linkedin_Static_C30_Tulane_3YWP',
'USP_Fairfield-MBA_2', 'USP_RG_webinar_V2', 'Kavita and Steffi',
'USP_Program_Linkedin_Static_C22_Drexel-27L',
'USP_Program_Linkedin_Static_C26_Dual-coursework_plain',
'USP_RG_webinar_V3_Mayank',
'USP_Program_Linkedin_Static_C16_Rutgers-Cost-white',
'USP_Drexel_6_MSCE_USD',
'USP_Program_Linkedin_Static_C10_SB-Savings',
'USP_Webinar_without_GMAT/GRE - Copy', 'USP_Generic-MBA_3',
'#7 ( Graphic ) - Get free pass',
'USP_Program_Linkedin_Static_C21_CWRU-22L',
'USP_Webinar_SaturdayPlan_1', 'USP_Webinar_MSunder12Months - Copy',
'USP_Webinar_MSunder12Months', 'Influencer 3', 'Influencer 4',
"#14 ( Graphic ) - Master's Degree In 12 Months_Drexel",
'USP_Webinar_SaturdayPlan_1 – Copy',
'USP_Webinar_SaturdayPlan_1 – Copy 2', 'USP_RIT_2_Repay',
"#13 ( Graphic ) - Master's Degree In 12 Months_Tulane",
"#12 ( Graphic ) - Master's Degree In 12 Months_Rutgers",
'USP_Drexel_5_MSIOT_INR', 'USP_Drexel_2_EM', 'USP_Drexel_4_Repay',
'USP_Drexel_8_MSIOT_USD', 'USP_Drexel_7_MSCE_INR',
'USP_Drexel_3_Getpaid', 'USP_Program_Linkedin_Static_C6_SUNY-Cost',
'USP_Webinar_Saturdayfuture - Copy', 'Influencer 1',
'USP_RIT_1_Repay', 'USP_Drexel-MBA_1',
'USP_Fairfield_MS-Marketing-2_5', 'USP_RIT_2_Getpaid',
'USP_RIT_3_EWYL_INR', 'USP_RIT_4_EWYL_USD',
'USP_Fairfield_MS-Marketing_4',
'#9 ( Graphic ) - Study In The US – Copy',
'USP_Program_Linkedin_Static_C20_CWRU-Cost-white',
'USP_Program_Linkedin_Static_C20_CWRU-Cost-White',
'USP_Program_Linkedin_Static_C18_Drexel-Cost-white',
"#6 ( Graphic ) - Master's Degree In 12 Months – Copy",
'#1 ( Graphic ) - Want To Build Your Career In The US? – Copy',
'#3 ( Graphic ) - Worried About GMAT Score? – Copy',
'#8 ( Graphic ) - Study In The USA – Copy',
'#2 ( Graphic ) - State University Of New York – Copy',
'Influencer 2', "#19 - Master's Degree In 12 Months – CW",
'USP_Drexel_1_Getpaid'], dtype=object)
Based on our observation in Power BI, we noticed that the campaign names for two specific campaigns (USP_Search_APTNTSKA_250124 and USP_Search_India-ex-APTNTSKA_250124) were displayed as '--'. Given the importance of maintaining data integrity and the inability to accurately assign them as campaign names or adset names, we recommend leaving these entries as they are. This approach will ensure the accuracy and consistency of the dataset without introducing any assumptions or errors.
df.isnull().sum()
dates 0 campaign_name 0 campaign_start_date 24287 creative_name 0 total_spent 0 impressions 0 clicks 0 click_through_rate 37 leads 0 platform 0 adset_name 4247 dtype: int64
# Drop the 'campaign_start_date' column
df = df.drop(columns=['campaign_start_date'])
We have removed the campaign_start_date column from the dataset because it contained a significant number of null values, making it impractical to fill in accurately. Since the column was not useful for our Exploratory Data Analysis (EDA), omitting it will streamline our analysis and focus on more relevant data.
df.isnull().sum()
dates 0 campaign_name 0 creative_name 0 total_spent 0 impressions 0 clicks 0 click_through_rate 37 leads 0 platform 0 adset_name 4247 dtype: int64
# Fill null values with 0 in the 'click_through_rate' column
df['click_through_rate'] = df['click_through_rate'].fillna(0)
We filled the null values in the click_through_rate column with '0'. This decision was based on our observation in Power BI that these rows had no clicks. Therefore, it is logical and accurate to set the click_through_rate to '0' for those entries. This ensures the integrity of the dataset and allows for more accurate analysis.
df.isnull().sum()
dates 0 campaign_name 0 creative_name 0 total_spent 0 impressions 0 clicks 0 click_through_rate 0 leads 0 platform 0 adset_name 4247 dtype: int64
# Fill null values in 'adset_name' with '--'
df['adset_name'] = df['adset_name'].fillna('--')
The null values in the adset_names column are due to the absence of ad set names for all campaigns on LinkedIn. Therefore, we cannot fill these values with creative_name or campaign_name. Instead, we will retain them as '--' to maintain the accuracy and consistency of the dataset.
df.isnull().sum()
dates 0 campaign_name 0 creative_name 0 total_spent 0 impressions 0 clicks 0 click_through_rate 0 leads 0 platform 0 adset_name 0 dtype: int64
All null values were cleared.
df.head()
| dates | campaign_name | creative_name | total_spent | impressions | clicks | click_through_rate | leads | platform | adset_name | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2024-04-29 | USP_Search_APTNTSKA_250124 | -- | 6766.15 | 1250 | 220 | 0.176000 | 25 | GMAT/GRE | |
| 1 | 2024-05-02 | USP_2_Lead-gen-2_Top4-states_050424 | #7 ( Graphic ) - Get A Free Pass | 946.38 | 3154 | 9 | 0.285352 | 1 | USP-2_KA_TN_050424 | |
| 2 | 2024-05-02 | USP_2_Lead-gen-2_Top4-states_050424 | USP_S3_never-been-easier | 170.40 | 2695 | 8 | 0.296846 | 0 | USP-2_KA_TN_050424 | |
| 3 | 2024-05-02 | USP_2_Lead-gen-2_Top4-states_050424 | USP_S4_Do-you-have | 55.65 | 577 | 2 | 0.346620 | 0 | USP-2_KA_TN_050424 | |
| 4 | 2024-05-02 | USP_2_Lead-gen-2_Top4-states_050424 | USP_S1_We-guarantee | 296.94 | 1466 | 10 | 0.682128 | 1 | USP-2_AP_TS_050424 |
# Set the style for the plots
sns.set(style="whitegrid")
# List of numerical columns to plot
columns_to_plot = ['total_spent', 'impressions', 'clicks', 'leads']
# Create a figure and axes for subplots
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(18, 12))
axes = axes.flatten()
# Plot box plots for each numerical column
for i, column in enumerate(columns_to_plot):
sns.boxplot(data=df, x=column, ax=axes[i])
axes[i].set_title(f'Box Plot of {column}')
axes[i].set_xlabel(column)
plt.tight_layout()
plt.show()
Based on our analysis of the data and graphs, we observed a skewed distribution, indicating the presence of outliers. Although the Interquartile Range (IQR) method is a standard approach for handling outliers, applying it in this case would result in the removal of a significant portion of the data. Considering the client's requirements and the necessity to retain all data points for comprehensive analysis, we recommend not removing any outliers in this dataset. This approach will ensure that the integrity of the dataset is maintained while fulfilling the client's needs.
# Dates
plt.figure(figsize=(10, 6))
sns.histplot(df['dates'], bins=30, kde=True)
plt.title('Distribution of Dates')
plt.xlabel('Dates')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()
Distribution of dates across the dataset.we can see that these were mostly high in the months of December,February and April.
Distribution of Numerical columns
# Set the plot style
sns.set(style="whitegrid")
# Plot the distribution of total_spent
plt.figure(figsize=(10, 6))
sns.histplot(df['total_spent'], bins=30, kde=True)
plt.title('Distribution of Total Spent')
plt.xlabel('Total Spent')
plt.ylabel('Frequency')
plt.show()
print("Mean : ",df['total_spent'].mean())
print("Median : ",df['total_spent'].median())
print("Min : ",df['total_spent'].min())
print("Max : ",df['total_spent'].max())
Mean : 554.6046712693629 Median : 53.285 Min : 0.0 Max : 27125.36
This skewness towards lower total spent suggests that most of the campaigns or data points in our dataset have relatively fewer total spent, with a concentration of values between 100 and 1000.
# Plot the distribution of impressions
plt.figure(figsize=(10, 6))
sns.histplot(df['impressions'], bins=30, kde=True)
plt.title('Distribution of Impressions')
plt.xlabel('Impressions')
plt.ylabel('Frequency')
plt.show()
print("Mean : ",df['impressions'].mean())
print("Median : ",df['impressions'].median())
print("Min : ",df['impressions'].min())
print("Max : ",df['impressions'].max())
Mean : 2381.018399102825 Median : 107.0 Min : 0 Max : 196585
This skewness towards lower impressions suggests that most of the campaigns or data points in our dataset have relatively fewer impressions, with a concentration of values between 100 and 1000.
# Plot the distribution of clicks
plt.figure(figsize=(10, 6))
sns.histplot(df['clicks'], bins=30, kde=True)
plt.title('Distribution of Clicks')
plt.xlabel('Clicks')
plt.ylabel('Frequency')
plt.show()
print("Mean : ",df['clicks'].mean())
print("Median : ",df['clicks'].median())
print("Min : ",df['clicks'].min())
print("Max : ",df['clicks'].max())
Mean : 18.770624518118737 Median : 1.0 Min : 0 Max : 3361
This skewness towards lower clicks suggests that most of the campaigns or data points in our dataset have relatively fewer clicks, with a concentration of values between 0 and 100.
# Plot the distribution of leads
plt.figure(figsize=(10, 6))
sns.histplot(df['leads'], bins=30, kde=True)
plt.title('Distribution of Leads')
plt.xlabel('Leads')
plt.ylabel('Frequency')
plt.show()
print("Mean : ",df['leads'].mean())
print("Median : ",df['leads'].median())
print("Min : ",df['leads'].min())
print("Max : ",df['leads'].max())
Mean : 1.1915258989275952 Median : 0.0 Min : 0 Max : 153
This skewness towards lower leads suggests that most of the campaigns or data points in our dataset have relatively fewer impressions, with a concentration of values between 0 and 5.We should improve the campaigns with '0' leads or spend less on them.
Distribution of categorical columns
# Set the style for the plot
sns.set(style="whitegrid")
# Create a figure for the plot
plt.figure(figsize=(10, 6))
# Plot the distribution of the 'platform' column
sns.countplot(y=df['platform'], order=df['platform'].value_counts().index)
# Set the title and labels
plt.title('Distribution of Platform')
plt.xlabel('Count')
plt.ylabel('Platform')
# Show the plot
plt.show()
The dataset shows that most of the data points are from Facebook.This suggests that Facebook dominates the dataset, potentially influencing the overall trends and insights derived from the data.
# Set the plot style
sns.set(style="whitegrid")
# Plot the distribution of campaign_name
plt.figure(figsize=(14, 10))
campaign_name_order = df['campaign_name'].value_counts().index
ax = sns.countplot(y=df['campaign_name'], order=campaign_name_order)
plt.title('Distribution of Campaign Names', fontsize=16)
plt.xlabel('Frequency', fontsize=14)
plt.ylabel('Campaign Name', fontsize=14)
# Rotate the labels for better visibility
ax.set_yticklabels(ax.get_yticklabels(), rotation=0, fontsize=13)
plt.show()
The distribution of campaign names indicates that campaigns with more data points generally have higher reach. This is inferred from the fact that campaigns with a larger number of impressions tend to have a higher frequency of occurrences in the dataset. Therefore, campaigns with more data points are likely to have achieved broader reach or engagement.
# Select the columns to analyze
columns_to_analyze = ['impressions', 'clicks', 'leads','total_spent']
# Compute the correlation matrix
correlation_matrix = df[columns_to_analyze].corr()
# Plot the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5, vmin=-1, vmax=1)
plt.title('Correlation Heatmap between Impressions, Clicks, and Leads')
plt.show()
The correlation matrix provides a rough overview of the relationships between the columns in our dataset. Eg.Impressions and cliks have strong +ve correlation,clicks and leads have weak +ve correlation,impressions and leads have moderately +ve correlation,impressions and total_spent have moderately +ve correlation,clicks and total_spent have moderate +ve correlation,leads and total_spent have moderate +ve correlation.
# Define the pairs of columns to create scatter plots for
pairs = [('impressions', 'clicks'),
('impressions', 'leads'),
('impressions', 'total_spent'),
('clicks', 'leads'),
('clicks', 'total_spent'),
('leads', 'total_spent')]
# Create a 3x2 grid of scatter plots
fig, axes = plt.subplots(3, 2, figsize=(15, 18))
for ax, (x, y) in zip(axes.flatten(), pairs):
sns.scatterplot(x=df[x], y=df[y], ax=ax)
ax.set_title(f'{x.capitalize()} vs. {y.capitalize()}')
ax.set_xlabel(x.capitalize())
ax.set_ylabel(y.capitalize())
plt.tight_layout()
plt.show()
By visualizing these relationships, the scatterplots provide a graphical confirmation of the correlations observed in the matrix, offering a more intuitive understanding of the data dynamics.
df.head()
| dates | campaign_name | creative_name | total_spent | impressions | clicks | click_through_rate | leads | platform | adset_name | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2024-04-29 | USP_Search_APTNTSKA_250124 | -- | 6766.15 | 1250 | 220 | 0.176000 | 25 | GMAT/GRE | |
| 1 | 2024-05-02 | USP_2_Lead-gen-2_Top4-states_050424 | #7 ( Graphic ) - Get A Free Pass | 946.38 | 3154 | 9 | 0.285352 | 1 | USP-2_KA_TN_050424 | |
| 2 | 2024-05-02 | USP_2_Lead-gen-2_Top4-states_050424 | USP_S3_never-been-easier | 170.40 | 2695 | 8 | 0.296846 | 0 | USP-2_KA_TN_050424 | |
| 3 | 2024-05-02 | USP_2_Lead-gen-2_Top4-states_050424 | USP_S4_Do-you-have | 55.65 | 577 | 2 | 0.346620 | 0 | USP-2_KA_TN_050424 | |
| 4 | 2024-05-02 | USP_2_Lead-gen-2_Top4-states_050424 | USP_S1_We-guarantee | 296.94 | 1466 | 10 | 0.682128 | 1 | USP-2_AP_TS_050424 |
# Convert 'dates' to datetime format if not already done
df['dates'] = pd.to_datetime(df['dates'])
# Setting plot style
sns.set(style="whitegrid")
# Plotting each metric over time in separate subplots in a 2x2 grid
fig, axs = plt.subplots(2, 2, figsize=(14, 12))
# Total Spent
sns.lineplot(ax=axs[0, 0], x='dates', y='total_spent', data=df, marker='o', color='b')
axs[0, 0].set_title('Total Spent Over Time')
axs[0, 0].set_xlabel('Date') # Added x-axis label
axs[0, 0].set_ylabel('Total Spent')
axs[0, 0].grid(True)
# Impressions
sns.lineplot(ax=axs[0, 1], x='dates', y='impressions', data=df, marker='o', color='orange')
axs[0, 1].set_title('Impressions Over Time')
axs[0, 1].set_xlabel('Date') # Added x-axis label
axs[0, 1].set_ylabel('Impressions')
axs[0, 1].grid(True)
# Clicks
sns.lineplot(ax=axs[1, 0], x='dates', y='clicks', data=df, marker='o', color='green')
axs[1, 0].set_title('Clicks Over Time')
axs[1, 0].set_xlabel('Date') # Added x-axis label
axs[1, 0].set_ylabel('Clicks')
axs[1, 0].grid(True)
# Leads
sns.lineplot(ax=axs[1, 1], x='dates', y='leads', data=df, marker='o', color='red')
axs[1, 1].set_title('Leads Over Time')
axs[1, 1].set_xlabel('Date') # Added x-axis label
axs[1, 1].set_ylabel('Leads')
axs[1, 1].grid(True)
# Adjusting layout to avoid overlap
plt.tight_layout()
plt.show()
Distribution of all the key metrics numerical columns according to the dates or time period using line plot ,and gives the observation of relations between them. We can see they mostly aligned.
plt.figure(figsize=(14, 6))
sns.lineplot(x='dates', y='click_through_rate', data=df, marker='o', color='purple')
plt.title('Click-Through Rate (CTR) Over Time')
plt.xlabel('Date')
plt.ylabel('CTR')
plt.grid(True)
plt.show()
# Convert 'dates' to datetime format if not already done
df['dates'] = pd.to_datetime(df['dates'])
# Determine the top 4 creative_names by average click_through_rate
top_creative_names = df.groupby('creative_name')['click_through_rate'].mean().nlargest(4).index
# Filter the dataframe to only include the top 4 creative_names
df_top4 = df[df['creative_name'].isin(top_creative_names)]
# Setting plot style
sns.set(style="whitegrid")
# Plotting click_through_rate over time, faceted by creative_name
g = sns.FacetGrid(df_top4, col="creative_name", col_wrap=2, height=4, aspect=1.5)
g.map(sns.lineplot, "dates", "click_through_rate", marker="o")
# Adding titles and labels
g.set_titles("{col_name}")
g.set_axis_labels("Date", "Click Through Rate")
for ax in g.axes.flat:
for label in ax.get_xticklabels():
label.set_rotation(45)
# Adjusting layout to avoid overlap
plt.tight_layout()
plt.show()
The line plots show the click-through rates over time for the top 4 creative names. Each subplot reveals the performance trends for these creatives, highlighting periods of higher or lower click-through rates and allowing for comparison between them. This visualization helps identify which creative names perform consistently well or show notable fluctuations.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 28534 entries, 0 to 28533 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dates 28534 non-null datetime64[ns] 1 campaign_name 28534 non-null object 2 creative_name 28534 non-null object 3 total_spent 28534 non-null float64 4 impressions 28534 non-null int64 5 clicks 28534 non-null int64 6 click_through_rate 28534 non-null float64 7 leads 28534 non-null int64 8 platform 28534 non-null object 9 adset_name 28534 non-null object dtypes: datetime64[ns](1), float64(2), int64(3), object(4) memory usage: 2.2+ MB
# Plot total_spent by adset_name
plt.figure(figsize=(14, 6))
sns.barplot(x='campaign_name', y='total_spent', data=df, palette='viridis')
plt.title('Total Spent by campaign')
plt.xlabel('Campaign Name')
plt.ylabel('Total Spent')
plt.xticks(rotation=90)
plt.show()
This analysis displays the campaigns with the highest expenditure. By examining these plots, we can identify which campaigns have the greatest financial investment, indicating where the most money is spent.
# Plot leads by campaign_name
plt.figure(figsize=(14, 6))
sns.barplot(x='campaign_name', y='leads', data=df, palette='magma')
plt.title('Leads by Campaign Name')
plt.xlabel('Campaign Name')
plt.ylabel('Leads')
plt.xticks(rotation=90)
plt.show()
This analysis shows the campaigns and their associated leads. By visualizing this data, we can identify which campaigns are generating the most leads and assess their effectiveness in driving lead generation.
# Aggregate data by campaign name and calculate total spent, impressions, clicks, and leads
campaign_metrics1 = df.groupby(['campaign_name','platform'])[['total_spent', 'impressions', 'clicks', 'leads']].sum().reset_index()
# Sort the data by leads in descending order and get the top 10 campaigns
top_10_campaigns = campaign_metrics1.sort_values(by='leads', ascending=False).head(10)
# Plot the top 10 campaigns by leads
plt.figure(figsize=(14, 8))
sns.barplot(x='leads', y='campaign_name', data=top_10_campaigns, palette='viridis')
plt.title('Top 10 Campaigns by Leads')
plt.xlabel('Leads')
plt.ylabel('Campaign Name')
plt.show()
The analysis highlights the top 10 campaigns that generated the most leads and their distribution. By examining this data, we can identify which campaigns were most successful in lead generation and observe how leads are distributed across these top-performing campaigns.As their total spent is mostly alligned with the leads.
# Display details of the top 10 campaigns
print("Details of Top 10 Campaigns by Leads:")
print(top_10_campaigns)
Details of Top 10 Campaigns by Leads:
campaign_name platform total_spent \
9 USP Study Abroad_Generic_240224 Facebook 2126684.24
13 USP_MS + Coop Campaign_240224 Facebook 1501040.54
14 USP_Marketing+MBA_New Leads Campaign_240224 Facebook 1125425.52
10 USP_2_Lead-gen-1_220124 Facebook 484768.66
32 USP_Video_More_Volume_No Logic_India_New Link Facebook 198416.23
34 USP_Webinar_070324 Facebook 752657.01
24 USP_Search_APTNTSKA_250124 Google 789571.12
0 CBO - USP Masterclass Cold [ 15th Dec ] - High... Facebook 186700.48
35 US_Pathway_2_Lead_gen_1_091123 LinkedIn 1884063.88
12 USP_COOP-Drexel_More_Volume_No Logic_India_240124 Facebook 126657.36
impressions clicks leads
9 9265548 48531 4630
13 4506382 27762 4446
14 8106285 37727 2390
10 1668181 10480 2124
32 1246514 12420 1866
34 4125312 16957 1616
24 178830 21663 1358
0 610938 12853 1346
35 5751255 34923 1232
12 395467 5238 1096
This information provides a comprehensive view of the performance of the top campaigns, showing how investment (total spend) correlates with engagement (clicks) and effectiveness (leads).
# Group by adset_name and sum the total_spent
total_spent_by_adset = df.groupby('adset_name')['total_spent'].sum().reset_index()
# Plotting the total_spent by adset_name
plt.figure(figsize=(14, 6))
sns.barplot(x='adset_name', y='total_spent', data=total_spent_by_adset, palette='viridis')
plt.title('Total Spent by Adset Name')
plt.xlabel('Adset Name')
plt.ylabel('Total Spent')
plt.xticks(rotation=90)
plt.show()
The bar chart displays total spend across different ad sets, highlighting which ad sets received the most investment and allowing for easy comparison of spending levels.
# Plot impressions by adset_name
plt.figure(figsize=(14, 6))
sns.barplot(x='adset_name', y='impressions', data=df, palette='viridis')
plt.title('Impressions by Adset')
plt.xlabel('Adset Name')
plt.ylabel('Impressions')
plt.xticks(rotation=90)
plt.show()
The chart shows the distribution of impressions across different ad sets, revealing which ad sets generated the most impressions and allowing for comparison of their reach.The chart reveals variations in reach, helping identify which ad sets were more effective in capturing audience attention.
# Plot leads by adset_name
plt.figure(figsize=(14, 6))
sns.barplot(x='adset_name', y='leads', data=df, palette='viridis')
plt.title('Leads by Adset')
plt.xlabel('Adset Name')
plt.ylabel('Leads')
plt.xticks(rotation=90)
plt.show()
The chart shows the distribution of leads across different ad sets, highlighting which ad sets generated the most leads. It allows for comparison of lead generation performance between ad sets, revealing which ones were most effective at converting impressions into leads. This insight can help evaluate the effectiveness of various ad sets and inform strategies to optimize lead generation efforts.
# Aggregate data by adset_name and calculate total spent, impressions, clicks, and leads
adset_metrics = df.groupby(['adset_name', 'campaign_name','platform'])[['total_spent', 'impressions', 'clicks', 'leads']].sum().reset_index()
# Sort the data by leads in descending order and get the top 20 ad sets
top_20_adsets = adset_metrics.sort_values(by='leads', ascending=False).head(20)
# Plot the top 20 ad sets by leads
plt.figure(figsize=(14, 8))
sns.barplot(x='leads', y='adset_name', data=top_20_adsets, palette='viridis')
plt.title('Top 20 Ad Sets by Leads')
plt.xlabel('Leads')
plt.ylabel('Ad Set Name')
plt.xticks(rotation=90)
plt.show()
The chart displays the top 20 ad sets ranked by lead generation, showcasing which ad sets were the most effective at producing leads. It enables a clear comparison of lead performance among these top ad sets and helps identify which ones delivered the best results. This insight can guide future marketing strategies and optimize lead generation efforts.
# Display details of the top 20 ad sets
print("Details of Top 20 Ad Sets by Leads:")
print(top_20_adsets)
Details of Top 20 Ad Sets by Leads:
adset_name \
73 MS + Coop All India Ad set - 240224
5 All India Counselling Look-a-like_290224
116 USP-2_Data_Job-title_India_exclude_KATNTSAP
76 Marketing+MBA All India - 240224
8 All India Generic Leads Ad Set_240224
74 MS + Coop KA&TN Ad set - 240224
90 Open Targeting - Advantage+ audience [ 7th Dec ]
0 --
55 GMAT/GRE
11 All India Study abroad Interest 070324
81 Open Targeting - Advantage+ audience
2 --
72 MS + Coop AP&TS Ad set - 240224
1 --
86 Open Targeting - Advantage+ audience [ 31st Oct ]
83 Open Targeting - Advantage+ audience [ 23rd Nov ]
13 All India focusing on Uni Ad Set_060324
19 Custom-Intent_Targeting
117 USP-2_Data_Job-title_KATNTSAP
77 Marketing+MBA KA&TN - 240224
campaign_name platform total_spent \
73 USP_MS + Coop Campaign_240224 Facebook 647685.15
5 USP Study Abroad_Generic_240224 Facebook 535064.20
116 USP_2_Lead-gen-1_220124 Facebook 308497.83
76 USP_Marketing+MBA_New Leads Campaign_240224 Facebook 551614.48
8 USP Study Abroad_Generic_240224 Facebook 647092.99
74 USP_MS + Coop Campaign_240224 Facebook 478504.92
90 CBO - USP Masterclass Cold [ 15th Dec ] - High... Facebook 184929.95
0 US_Pathway_2_Lead_gen_1_091123 LinkedIn 1884063.88
55 USP_Search_APTNTSKA_250124 Google 689360.86
11 USP_Webinar_070324 Facebook 480984.59
81 USP_Video_More_Volume_No Logic_India_New Link Facebook 55359.97
2 US_Pathway_2_Lead_gen_3_241123 LinkedIn 1749087.89
72 USP_MS + Coop Campaign_240224 Facebook 374850.47
1 US_Pathway_2_Lead_gen_2_091123 LinkedIn 1469168.59
86 CBO - USP Masterclass Cold [ 23rd Nov ] - More... Facebook 244398.01
83 CBO - USP Masterclass Cold [ 23rd Nov ] - High... Facebook 95114.56
13 USP Study Abroad_Generic_240224 Facebook 318038.10
19 USP_Video_APTLTSKA_Lead-Gen_Custom-intent_050324 Google 155920.03
117 USP_2_Lead-gen-1_220124 Facebook 176270.83
77 USP_Marketing+MBA_New Leads Campaign_240224 Facebook 319277.52
impressions clicks leads
73 1901299 13861 2341
5 1516258 10984 1884
116 1011413 6932 1505
76 4427087 23048 1486
8 3341182 16973 1375
74 1346976 8000 1352
90 599079 12767 1338
0 5751255 34923 1232
55 161249 19360 1100
11 2839529 11106 1028
81 308755 5296 950
2 6317435 36201 874
72 1258107 5901 753
1 4225764 26627 723
86 818847 10820 710
83 429447 6682 710
13 1420202 8710 674
19 1220168 25558 623
117 656768 3548 619
77 1959033 7984 562
This details the top 20 ad sets by lead generation, including key metrics such as campaign name, impressions, clicks, and leads. This comprehensive view highlights which ad sets were most effective in generating leads and provides a comparison of their performance across various metrics. By analyzing these details, we can identify high-performing ad sets and understand their contribution to overall campaign success.
# Aggregate data by creative name and calculate the mean CTR
creative_ctr = df.groupby('creative_name')['click_through_rate'].mean().reset_index()
# Sort the data by CTR in descending order and get the top 20
top_creatives_by_ctr = creative_ctr.sort_values(by='click_through_rate', ascending=False).head(20)
# Plot CTR by creative name
plt.figure(figsize=(14, 6))
sns.barplot(x='creative_name', y='click_through_rate', data=top_creatives_by_ctr, palette='viridis')
plt.title('Top 20 Creatives by Click-Through Rate')
plt.xlabel('Creative Name')
plt.ylabel('CTR (%)')
plt.xticks(rotation=90)
plt.show()
The chart displays the top 20 creative names ranked by click-through rate (CTR). It reveals which creatives achieved the highest engagement rates. By showcasing the CTR for these top-performing creatives, the chart allows for a comparison of their effectiveness in driving clicks relative to impressions. This insight helps identify the most successful creative strategies and can inform future design and targeting decisions.
# Aggregate data by creative_name and calculate total spent, impressions, clicks, and leads
creative_metrics = df.groupby(['creative_name', 'campaign_name', 'adset_name','platform'])[['total_spent', 'impressions', 'clicks', 'leads']].sum().reset_index()
# Sort the data by leads in descending order and get the top 20 creatives
top_20_creatives = creative_metrics.sort_values(by='leads', ascending=False).head(20)
# Plot the top 20 creatives by leads
plt.figure(figsize=(14, 8))
sns.barplot(x='creative_name', y='leads', data=top_20_creatives, palette='viridis', order=top_20_creatives['creative_name'].tolist())
plt.title('Top 20 Creatives by Leads')
plt.xlabel('Creative Name')
plt.ylabel('Leads')
plt.xticks(rotation=90)
plt.show()
The chart presents the top 20 creative names ranked by the number of leads generated. It highlights which creatives were most effective in converting interactions into leads. By showing the lead counts for these top-performing creatives, the chart enables comparison of their effectiveness and helps identify which creative names delivered the best results in lead generation. This information can guide future creative strategies and optimize lead acquisition efforts.
# Display details of the top 20 creatives
print("Details of Top 20 Creatives by Leads:")
print(top_20_creatives)
Details of Top 20 Creatives by Leads:
creative_name \
1147 USP_RIT_1_Getpaid
493 #6 ( Graphic ) - Master's Degree In 12 Months
1 --
1333 USP_kavita&steffi_VideoAd6
940 USP_Influencer_videoAd2
524 #6 ( Graphic ) - Master's Degree In 12 Months
500 #6 ( Graphic ) - Master's Degree In 12 Months
504 #6 ( Graphic ) - Master's Degree In 12 Months
1148 USP_RIT_1_Getpaid
526 #6 ( Graphic ) - Master's Degree In 12 Months
1146 USP_RIT_1_Getpaid
1014 USP_Program_Linkedin_Static_C11_SB-3YWP
652 #8 ( Graphic ) - Study In The USA
795 USP_Drexel_1_EWYL
1324 USP_kavita&steffi_VideoAd6
531 #6 ( Graphic ) - Master's Degree In 12 Months
410 #4 ( Graphic ) - How To Get Admmission In Your...
1277 USP_Webinar_MSunder12Months
532 #6 ( Graphic ) - Master's Degree In 12 Months
941 USP_Influencer_videoAd2
campaign_name \
1147 USP_MS + Coop Campaign_240224
493 CBO - USP Masterclass Cold [ 15th Dec ] - High...
1 USP_Search_APTNTSKA_250124
1333 USP_Marketing+MBA_New Leads Campaign_240224
940 USP_Video_More_Volume_No Logic_India_New Link
524 USP Study Abroad_Generic_240224
500 CBO - USP Masterclass Cold [ 23rd Nov ] - High...
504 CBO - USP Masterclass Cold [ 23rd Nov ] - More...
1148 USP_MS + Coop Campaign_240224
526 USP Study Abroad_Generic_240224
1146 USP_MS + Coop Campaign_240224
1014 US_Pathway_2_Lead_gen_1_091123
652 USP_2_Lead-gen-1_220124
795 USP_MS + Coop Campaign_240224
1324 USP Study Abroad_Generic_240224
531 USP_2_Lead-gen-1_220124
410 USP_2_Lead-gen-1_220124
1277 USP_Webinar_070324
532 USP_2_Lead-gen-1_220124
941 USP_Video_More_Volume_No Logic_India_New Link
adset_name platform \
1147 MS + Coop All India Ad set - 240224 Facebook
493 Open Targeting - Advantage+ audience [ 7th Dec ] Facebook
1 GMAT/GRE Google
1333 Marketing+MBA All India - 240224 Facebook
940 Open Targeting - Advantage+ audience Facebook
524 All India Counselling Look-a-like_290224 Facebook
500 Open Targeting - Advantage+ audience [ 23rd Nov ] Facebook
504 Open Targeting - Advantage+ audience [ 31st Oct ] Facebook
1148 MS + Coop KA&TN Ad set - 240224 Facebook
526 All India Generic Leads Ad Set_240224 Facebook
1146 MS + Coop AP&TS Ad set - 240224 Facebook
1014 -- LinkedIn
652 USP-2_Data_Job-title_India_exclude_KATNTSAP Facebook
795 MS + Coop KA&TN Ad set - 240224 Facebook
1324 All India Generic Leads Ad Set_240224 Facebook
531 USP-2_Data_Job-title_India_exclude_KATNTSAP Facebook
410 USP-2_Data_Job-title_India_exclude_KATNTSAP Facebook
1277 All India Study abroad Interest 070324 Facebook
532 USP-2_Data_Job-title_KATNTSAP Facebook
941 Related To Study In US Facebook
total_spent impressions clicks leads
1147 363179.88 1208355 8025 1567
493 175527.01 590020 12490 1275
1 689360.86 161249 19360 1100
1333 349155.30 3844267 19526 1016
940 50556.92 299432 5106 883
524 264607.03 989837 5357 872
500 87028.19 414751 6244 658
504 215547.83 784476 10155 637
1148 242991.84 858683 3823 607
526 248570.95 1264680 5037 488
1146 247045.75 826908 3447 472
1014 497729.37 1316291 9296 447
652 101672.76 592552 3269 443
795 104773.37 89517 1682 424
1324 198709.93 1662397 8836 418
531 95464.78 377001 2467 399
410 46909.52 8535 295 374
1277 166060.22 923194 3319 368
532 111897.24 498682 2607 367
941 43993.45 345301 2598 358
This comprehensive view allows for an in-depth comparison of top-performing creatives, showing their impact across various metrics and helping to identify the most effective strategies for lead generation.
# Plot average impressions by platform
plt.figure(figsize=(10, 6))
sns.barplot(x='platform', y='impressions', data=df, palette='viridis', estimator=np.mean, ci=None)
plt.title('Average impressions by Platform')
plt.xlabel('Platform')
plt.ylabel('Average impressions')
plt.xticks(rotation=45)
plt.show()
C:\Users\KRUSHNA KOUSHIK\AppData\Local\Temp\ipykernel_4080\1048618588.py:3: FutureWarning: The `ci` parameter is deprecated. Use `errorbar=None` for the same effect. sns.barplot(x='platform', y='impressions', data=df, palette='viridis', estimator=np.mean, ci=None)
The analysis shows that Google ad sets are achieving higher average impressions or reach compared to other platforms. This indicates that Google campaigns are reaching a larger audience on average, suggesting greater visibility and potential engagement.
# Plot leads by platform
plt.figure(figsize=(10, 6))
sns.barplot(x='platform', y='leads', data=df, palette='viridis', estimator=sum, ci=None)
plt.title('Leads by Platform')
plt.xlabel('Platform')
plt.ylabel('Total Leads')
plt.xticks(rotation=45)
plt.show()
C:\Users\KRUSHNA KOUSHIK\AppData\Local\Temp\ipykernel_4080\2669477451.py:3: FutureWarning: The `ci` parameter is deprecated. Use `errorbar=None` for the same effect. sns.barplot(x='platform', y='leads', data=df, palette='viridis', estimator=sum, ci=None)
Total leads by platform shows that facebook have more because it has more campiagns and all.
# Plot average leads by platform
plt.figure(figsize=(10, 6))
sns.barplot(x='platform', y='leads', data=df, palette='viridis', estimator=np.mean, ci=None)
plt.title('Average Leads by Platform')
plt.xlabel('Platform')
plt.ylabel('Average Leads')
plt.xticks(rotation=45)
plt.show()
C:\Users\KRUSHNA KOUSHIK\AppData\Local\Temp\ipykernel_4080\2387000561.py:3: FutureWarning: The `ci` parameter is deprecated. Use `errorbar=None` for the same effect. sns.barplot(x='platform', y='leads', data=df, palette='viridis', estimator=np.mean, ci=None)
The analysis indicates that Google ad sets are consistently generating more average leads than those on other platforms. This higher lead generation suggests that Google’s targeting or ad strategies are more effective in converting potential customers into actual leads. The greater average lead count highlights Google's strong performance in attracting and engaging users compared to other advertising platforms. This insight can guide future ad spend decisions and optimization strategies to leverage Google’s effectiveness in lead generation.
# Count the number of campaigns by platform
campaign_counts_by_platform = df.groupby('platform')['campaign_name'].nunique().reset_index()
campaign_counts_by_platform.columns = ['Platform', 'Campaign Count']
# Plot the count of campaigns by platform
plt.figure(figsize=(12, 6))
sns.barplot(x='Platform', y='Campaign Count', data=campaign_counts_by_platform, palette='viridis')
plt.title('Count of Campaigns by Platform')
plt.xlabel('Platform')
plt.ylabel('Number of Campaigns')
plt.xticks(rotation=45)
plt.show()
Facebook has the highest number of campaigns, resulting in a larger number of campaigns with high lead counts. However, due to the large volume of campaigns, the average leads per campaign are lower. This indicates that while Facebook campaigns are numerous and capable of generating many leads overall, the average performance per campaign is reduced compared to platforms like Google, which may have fewer campaigns but higher average leads.
Distribution and Skewness: The data shows high activity in December, February, and April. Most campaigns have lower total spends, primarily between $100 and $1000. Similarly, impressions are mostly between 100 and 1000, clicks range from 0 to 100, and leads are often between 0 and 5. Campaigns with '0' leads should be re-evaluated or have their budgets reduced.
Platform Insights: Facebook contributes the majority of data points, significantly influencing overall trends. However, Google ad sets outperform other platforms in generating higher average impressions and leads, showing Google's effectiveness in reaching and converting customers.
Campaign and Creative Analysis: Campaigns with more data points tend to have higher reach and engagement. The top 10 campaigns by lead generation show a strong correlation between total spent and leads, indicating efficient budget use. The top 20 creative names by leads highlight effective strategies with detailed metrics such as campaign name, ad set name, total spent, impressions, clicks, and leads.
Correlation and Relationships: The correlation matrix shows that impressions and clicks have a strong direct relationship, while clicks and leads have no significant relationship. Impressions and total spent, as well as leads and total spent, show slight direct relationships, but clicks and total spent have no significant relationship.
Strategic Insights: The analysis identifies the highest financial investments, showing Google campaigns as more effective on average. Facebook campaigns, despite being numerous, have lower average leads per campaign due to volume. Google's superior lead generation should guide future ad spend decisions and optimization strategies.
To optimize campaign performance and maximize ROI, it is essential to focus on improving or reducing the budget for campaigns that have generated '0' leads, ensuring a more efficient use of resources. Given the higher effectiveness of Google campaigns in generating leads, reallocating more budget to these campaigns can enhance overall lead generation. Additionally, it is crucial to enhance strategies for Facebook campaigns to increase their average lead performance, potentially by refining targeting, creative content, and bidding strategies. These insights and conclusions should guide strategic decisions to optimize campaign performance, allocate budgets more effectively, and leverage the unique strengths of each platform for better overall results. By continuously monitoring and adjusting based on performance data, you can ensure sustained improvement and success in your advertising efforts.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Selecting features and target variable
features = ['clicks', 'total_spent', 'click_through_rate', 'impressions', 'creative_name', 'adset_name', 'campaign_name']
target = 'leads'
# Converting categorical variables into dummy variables
df_dummies = pd.get_dummies(df[features], drop_first=True)
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df_dummies, df[target], test_size=0.3, random_state=42)
# Initialize and train the Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
# Make predictions
y_pred_lr = lr_model.predict(X_test)
# Evaluate the model
mae_lr = mean_absolute_error(y_test, y_pred_lr)
mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)
print(f"Linear Regression:")
print(f" Mean Absolute Error: {mae_lr}")
print(f" Mean Squared Error: {mse_lr}")
print(f" R² Score: {r2_lr}")
Linear Regression: Mean Absolute Error: 1.117966062167505 Mean Squared Error: 7.253360974857454 R² Score: 0.6144846031324868
Output : The Linear Regression model shows a reasonable predictive performance with an average prediction error of 1.12 leads (MAE) and a manageable average squared error of 7.25 (MSE). The model explains 61.4% of the variance in leads (R²), indicating a moderate to strong fit. Overall, the model effectively captures patterns in lead generation, though there is potential for further improvement.
As Linear Regression did not meet our accuracy expectations, we opted for Random Forest Regression due to its superior ability to handle complex, non-linear relationships. Unlike Linear Regression, which assumes a straight-line relationship between variables, Random Forest Regression leverages an ensemble of decision trees to capture intricate patterns and interactions within the data.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Selecting features and target variable
features = ['clicks', 'total_spent', 'click_through_rate', 'impressions', 'creative_name', 'adset_name', 'campaign_name']
target = 'leads'
# Converting categorical variables into dummy variables
df_dummies = pd.get_dummies(df[features], drop_first=True)
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df_dummies, df[target], test_size=0.3, random_state=42)
# Initialize and train the Random Forest Regressor
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)
# Make predictions
y_pred = rf_model.predict(X_test)
# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Random Forest Regressor:")
print(f" Mean Absolute Error: {mae}")
print(f" Mean Squared Error: {mse}")
print(f" R² Score: {r2}")
# Feature Importances
importances = rf_model.feature_importances_
feature_names = X_train.columns
feature_importances = pd.Series(importances, index=feature_names).sort_values(ascending=False)
# Get top 20 features
top_20_features = feature_importances.head(20)
print("Top 20 Feature Importances (Random Forest):")
print(top_20_features)
# Plotting the top 20 feature importances
plt.figure(figsize=(12, 8))
sns.barplot(x=top_20_features, y=top_20_features.index)
plt.title('Top 20 Feature Importances (Random Forest)')
plt.xlabel('Importance Score')
plt.ylabel('Feature')
plt.show()
Random Forest Regressor: Mean Absolute Error: 0.5375668730288518 Mean Squared Error: 3.630365880154187 R² Score: 0.8070464233183396 Top 20 Feature Importances (Random Forest): clicks 0.476358 total_spent 0.214003 click_through_rate 0.151406 impressions 0.033867 creative_name_USP_kavita&steffi_VideoAd6 0.016300 campaign_name_USP Masterclass Cold [ 14th Dec ] - Link Click Ad 0.012111 adset_name_Open Targeting - Advantage+ audience 0.009721 adset_name_Study Abroad 0.007942 adset_name_MS + Coop All India Ad set - 240224 0.006523 creative_name_#4 ( Graphic ) - How To Get Admmission In Your Dream Universities In US 0.005952 campaign_name_USP_MS + Coop Campaign_240224 0.004765 campaign_name_USP_Search_India-ex-APTNTSKA_250124 0.004203 adset_name_All India Counselling Look-a-like_290224 0.003028 adset_name_Open Targeting - Advantage+ audience [ 7th Dec ] 0.002975 creative_name_USP_Drexel_1_EWYL 0.002519 creative_name_USP_RIT_1_Getpaid 0.002288 campaign_name_USP_2_Lead-gen-1_220124 0.002225 campaign_name_USP Masterclass Cold [ 23rd Nov ] - More volume - With Logic - India 0.002072 creative_name_USP_Program_Linkedin_Static_C11_SB-3YWP 0.001965 campaign_name_USP Study Abroad_Generic_240224 0.001601 dtype: float64
Conclusion :
The Random Forest Regressor has demonstrated strong predictive performance with an R² score of 0.81, indicating that it effectively captures 81% of the variance in lead generation. The model’s Mean Absolute Error (0.54) and Mean Squared Error (3.63) suggest that predictions are reasonably accurate with manageable error levels. Key insights reveal that the most influential factors for generating leads are the number of clicks and total spending, highlighting their crucial roles in campaign effectiveness. The click-through rate also plays a significant role, albeit less impactful. These findings suggest that the company should focus on optimizing click volumes and strategic spending to maximize lead generation. Additionally, refining ad creatives and campaigns based on their impact can further enhance performance. Regular model updates and feature analysis will help sustain and improve lead prediction accuracy, guiding more informed decision-making and strategic planning.